Introduction

In our first project, we analyzed daily weather patterns from data collected at a weather station in Central Park, New York City made available online by the National Oceanic and Atmospheric Administration. Through our analysis, we confirmed that there was a statistically significant rise in daily maximum temperatures in Central Park over the last 122 years.

We performed an ANOVA test on daily maximum temperature values over different periods of time and found statistically significant results regarding variance in-between our samples. This led us to create linear models for the change in daily maximum temperature over time, revealing statistically significant warming at an average rate of about 0.025 degrees Fahrenheit per year from 1900-2022. This is in fact a larger increase in temperature than the average global warming trend reported by [INSERT ORGANISATION NAME HERE!] (an average of 0.014 degrees Fahrenheit per year). However, since 1982, average temperatures in Central Park have increased significantly less than average global warming, perhaps because much of the development in New York City took place during the first half of the century.

We had more questions about relationships between weather and human activity, which are explored here in our Final Project.

Our Research Questions

For this project, we looked more directly at correlations between human activity and weather by incorporating new datasets related to population, air quality, crime (shootings and arrests), the stock market, and COVID-19.

  1. Do changes in Central Park maximum daily temperatures also correlate to measures of human activity?
  2. Are there statistically measurable changes in NYC air quality over time, and are they correlated to changes in daily maximum temperature observed in previous analysis?
  3. How do daily weather patterns correlate to other local human activity, such as crime, reported COVID cases, and stock market performance?

Central Park Warming and Human Activity

We developed several linear models of Central Park warming over time in Project 1 using all daily TMAX observations between 1900 and 2022. We observed statistically significant correlations between month (as a categorial variable), reflecting seasonal temperature variations, and year (as a numeric variable), reflecting a longer-term warming trend.

The resulting fit (using all daily data) has an \(R^2\) value of 0.775 and a slope of 0.025 degrees Fahrenheit per year, with all fit parameters’ p-values well below 0.05. The different intercepts for the each level of the categorical variable (the twelve months of the year) indicated that January is the coldest and July the hottest month in Central Park, with an average difference in maximum daily temperature of approximately 46 degrees Fahrenheit in any given year over this window.

The two extremes and their linear models are plotted again here to illustrate the quality of fit.

However, we expect that time is just a proxy for the cause: human activity– namely, construction in the local environment using materials that hold more heat than the natural environment (EPAb 2022), and increased consumption of energy and materials that release greenhouse gases over the course of their life cycle. We wanted to explore correlations with other, more direct proxies for human activity.

We sought other data sources related to these trends, for example local economic data (New York GDP), construction (sector revenue or number of new sites), New York City population, and local GHG emissions. While we found some sources for most of these, unfortunately none extended more than 50 years into the past with daily or annual observations except one: annual New York State population data (obtained from Macrotrends.com). Because this source was unknown, we validated the reported value for each decade against the corresponding data from the U.S. Census (it appears that values for intervening years were interpolated somehow).

We then recreated a new linear model for TMAX (for 1900-2021 due to range of population data), attempting to use New York State population instead of year: ‘TMAX ~ population + month’.

While we do observe a significant correlation – the p-value of the model’s population coefficient is well below the cutoff of alpha = 0.05) – the fit appears not to be any better than the regression against year. This caused us to wonder if a fit using month alone would be better than a fit using month or year, or population or year.

The \(R^2\) value is a bit lower (though not much: 0.771 v. 0.773). To assess whether the model is distinct from those also including population or year, we ran ANOVA tests on each pair of nested models. According to the tests, the models are in fact significantly different (p-values of the chi-squared fits were well below an alpha of 0.05).

Nonetheless, given that the data are partially interpolated values and that we are looking at subtle, long-term trends over only 122 years at a single location– and that climate change is a complicated, average, and distributed phenomenon– we have probably exhausted our ability to build climate change models based solely on Central Park data.

For fun, we looked for correlations between temperature and other weather variables between 1900 and 2022.

We do see some correlation between TMAX and SNOW, and so add this into the model (TMAX ~ year + month + SNOW). This is slightly improved (with all coefficients’ p-values well below 0.05): year, month, and whether it is snowing appear to account for 77.6% ($R^2 = 0.776) of the variability of in daily maximum temperature in Central Park. (However, we note that 96 NAs for SNOW had to be removed to develop the model, so this result is likely not directly comparable to the other models). We also considered additional local environmental correlations, described in the next section.

Local Weather and Air Quality

The Air Quality Index (AQI) is used for daily reporting of local air quality. It tells us how clean or polluted the air is, and what associated health effects might be a concern for the public. The higher the AQI value, the the greater the level of air pollution and greater the health concern. Outdoor concentrations of pollutants such as ozone, carbon monoxide, nitrogen dioxide, sulfur dioxide, and PM2.5/PM10 concentrations are measured at stations across New York City and reported to the EPA. The daily AQI is calculated based on these concentration values and stored within the EPA’s Air Quality System database.

Changes in urban life correlate with changes in air quality within that urban area. Sources of emissions such as traffic and burning of fossil fuels for energy generation can cause air quality to deteriorate. Emissions can also contribute to global warming by releasing more greenhouse gasses into the atmosphere, thus increasing average temperatures. As more people migrate to urban areas, we will continue to see a deterioration in air quality unless reducing measures are taken. Our goal for integrating this data is to study the affects of weather patterns on air quality, and to statistically verify changes in air quality over time in New York City.

The dataset contains about 7,000 observations collected from January, 2000 to October, 2022.

We start by looking at the distribution of our variable of interest: AQI.

From the histogram above, we can gauge that the distribution is slightly right-skewed. With the large number of observations in our dataset, we can assume normality for our modeling. The right-skewness is caused by days with unusually high AQI values.

The year-over-year growth rate was also calculated based on yearly average AQI and is depicted in the line plot below.

We can see an alternating patterns of increase and decrease in average AQI between each year from 2000 to 2009. After 2009, the pattern is broken but the variance continues.

In order to evaluate correlation between weather and air quality, we combined our dataset with the NYC weather data based on the date value in each. Dates without a matching air quality measurement are dropped. Subsequent models will be built using this merged dataframe.

The first step to building linear models is assessing correlation between numerical variables in the data. Because the year variable in our dataset begins at 2000, it will unnecessarily scale our coefficients when used in linear modeling. Therefore, we scaled the variable to start at 0 (and continue to 22 to represent 2022).

The correlation is evaluated via a pairs plot, which depicts the correlation coefficient between numerical variables, and includes scatterplots of their relationships. The pairs plot uses the Pearson correlation method.

From the pearson pairsplot above, we can see a moderately high, negative correlation value between year and AQI. This indicates that as the year increases, the AQI is actually dropping resulting in better air quality in the city.

To better observe the effects of year on AQI, we can visualize the yearly average AQI.

The line plot confirms the correlation value we observed in the pairs plot. The average yearly AQI is indeed decreasing as year increases. Next, we build a linear model using year as a regressor to estimate daily AQI.

The results of our linear model reveal a significant value for both the intercept and year coefficient. The coefficient value for the year regressor indicates that for every year increase, the predicted daily AQI decreases by a factor of 1.78. This supports the correlation coefficient we saw earlier between these two variables. The p-value of the F-statistic is also significant, but the \(R^2\) value of the model is a measly 0.28. Based on this model, the year only explains 28% of the variability in daily AQI measurements. This is not a significantly high result. Looking at the scatterplot of the relationship can help explain the weak fit.

As we can see, there is a high degree of noise when observing daily AQI values at the yearly level. Although the plot displays a slightly downward trend in daily AQI, but model fit is distorted. This helps explain the results we received from our linear model.
Can we add more or different predictors to improve the fit? In our first project, we looked at linear trends of TMAX over time and determined a slight positive correlation observed over the years 1900-2022. We also utilized month as a categorical regressor to help explain the variance in daily maximum temperatures. Based on those results, we concluded that seasonality trends had a negative impact on model performance. Perhaps seasonality also also plays a part in daily AQI measurements.

To refresh our memories, we included the monthly average daily maximum temperature. A seasonal trend can be observed as temperatures increase during summer months and decrease during winter months.

Plotting the average AQI by month, we observe seasonal trends. AQI values are generally high during winter and summer months, but realtively low for the the months in between.

Based on this, we modify our last model and attempt to fit seasonality by adding month as a categorical regressor, along with our variable-of-interest from the last project - TMAX.

The regression coefficient for TMAX is significant and positively correlated, with each degree Fahrenheit increase resulting in AQI increasing by a factor of 0.68. The regression coefficients for all month categories are also significant. In fact, every month has a negative impact on AQI when compared to January. September exhibits the largest difference, with a predicted AQI almost 44 points lower than January!

The p-value of the model’s F-statistic is also significant, concluding a significant relationship between our chosen predictors and the daily AQI value. However, the \(R^2\) for our model is only .149, which is weaker than our previous model. This indicates that only 14.7% of the variation in daily AQI can be explained by TMAX and month.

The VIF scores for all regressors are in an acceptable range, however the fit is still poor. It seems that due to seasonal nature of our time-series data, we cannot properly model daily AQI using linear regression. Perhaps a classification technique can be utilized to address the seasonal trends. More precisely, we can build a kNN model to classify the month based on daily AQI and maximum temperature values.

We start with plotting the relationship between our chosen predictors and add a layer to discern month within the plot.

We can make out minimal distinction of month from the scatterplot above, but the model will provide a more detailed analysis.

The first step involves scaling and centering our predictor values, as they are recorded in vastly different units of measurement. We also need to split our dataset into training and testing frames. We used a 4:1 split for to satisfy this requirement.

To find the optimal k-value, we evaluated the model over a range of k from 1 to 21. Based on the plot above, it seems 13-nearest neighbors is a decent choice as it provides the greatest improvement in predictive accuracy before the incremental improvement trails off. We can build the kNN model using 13 as the k-value.

The overall accuracy of our model is a relatively weak value of 0.257. This indicates that AQI and TMAX are not good predictors of month.

## Multi-class area under the curve: 0.644

A multi-class ROC evaluation on the test labels yields an AUC value of 0.65, which is higher than expected based on the model’s accuracy value. Still, this is not a significant result based on the AUC threshold of 0.8.

Local Weather and Local Human Social and Economic Activity

The last question our team set out to address is whether there are statistically significant relationship exists between local weather and local human social and economic activity. The relationship has been been explored by researchers in the past, with evidence suggesting weather has effects on an individual’s mood, thermal comfort level, and social interaction and can influence traffic, travel, public health, crime rates, and even stock prices (Horanont, 2013). Our team looked to expand on this prior work by focusing on specific observations in New York City, and combining these with our weather data. We decided to look specifically at the relationship between local weather and crime rates, stock prices, and public health.

We brought in two additional data sets to explore the relationship between weather and crime rates: data for all New York City shootings and for all New York City arrests with 5,844 observations between 2006 and 2021. Both of these data sets were available through NYC’s Open Data repository (https://opendata.cityofnewyork.us/).

To interrogate the relationship between weather and on public health (beyond the air quality indicators explore in the preceding section), our team looked at COVID-19 case counts in New York City. The data were available from New York City’s Open Data repository. The data set consisted of 991 observations between February 29th, 2020 and present day.

For stock market data, we looked at the total daily trade for the DOW Jones Industrial Index. These data contained 10,822 observations dating back to December 25, 1979, however trade volume was only available since 1988.

##Shootings

We found records of all reported shootings, with at least one injured victim, occurring in New York City from 2006-2021. From here we extracted the daily number of shootings (and number of these designated as murders) over the entire time period, and merged this information with our weather data from the same time frame. We conducted some simple EDA. As indicated in the figure below, there is seasonal variation, with more shootings occurring in summer months than in the winter.

We then completed a simple correlation plot, and observed negative correlations between number of shootings with year and PRCP, and a positive correlation with SNOW and TMAX.

We wanted to use all of these variables as regressors in a linear model. EDA revealed that the shooting data are not normally distributed, but are right-skewed – not surprising, given that they are bound on the left side by zero – as apparent in the plot below.

However, since there were > 5,000 data points we created a linear model anyhow, using year and our main weather variables as regressors (we tried using month also, but most of the resulting coefficients for this factor variable were not statistically significant).

All fit parameters for this linear model (Shootings ~ year + SNOW + PRCP + TMAX) were significant (p-values of coefficients less than 0.05), and the multiple \(R^2\) was 0.129. This suggests that the number of shootings per day in New York City between 2006 and 2021 has been decreasing; that statistically significantly fewer shootings are likely when it is raining; and that shootings are more likely to be reported when it is snowing than when it is not, and for higher daily maximum temperatures.

These correlations could reflect trends in human predilection for gun violence as a function of weather (when it’s raining, do people stay home instead of going out with their guns? when it’s hotter, are they somewhat more likely to be riled up enough to fire them?). Or it could reflect different availability or ability of individuals to report shootings based on weather conditions (fewer witnesses when it’s raining? improved ease of spotting blood in the snow?). It would be difficult to understand any underlying causality without studying these individuals using social science methods– including interviews and further statistical analysis– or by bringing in additional data that might account for number of people out on the street or where shootings are most likely to occur.

However, the residuals for this model were not quite evenly balanced, so we we experimented with variable transformation, to see if we could generate a more predictive model, taking:
1. the square root of the daily number of shootings, and
2. the log of (1 + the daily number of shootings),
both of which generated a more normally distributed response variable.

These dependent variable-transformed models are of similar quality, both with an adjusted \(R^2\) of 0.137, an improvement over the 0.128 value for the linear model with a non-transformed regressand (all models have F-statistic p-values of less than 0.05). Interestingly, in the variable-transformed models, the model coefficient for SNOW was not statistically significant in the new model. Given the relatively low \(R^2\) values, we did not attempt to use these models for prediction.

Arrest data

After looking at the relationship between reported shootings in NYC and local weather, we looked at arrests in New York (for any crime). As with the shootings data, these data were not previously analyzed by our group so we did some basic EDA to determine the distribution of the new data and identify correlations that exist.

For this and the later analyses of weather’s relationship with human activity, we used TMAX, the daily maximum temperature, and precipitation factor, a binary value for whether or not it rained or snowed on a daily basis. We incorporated month and year into these relationships as we saw seasonality effects and changes over time in our previous effort.

The histogram shows the distribution of the number of daily arrests in NYC to be a slight bimodal distribution. However, because of the large sample size, we will consider this a normal distribution.

We plotted arrests and TMAX, with points coded by precipitation factor and month. This plot is below with no initial patterns that can be discerned.

We performed a t-test and determined a significantly higher number of arrests on days without precipitation.

We also looked at the a correlation plot for weather variables and arrests. The strongest correlation exists between number of arrests and year, followed by month. Temperature appears to have no correlation with crime while precipitation has a very minimal negative correlation with crime. However, we know these correlations might not tell the whole story so we created a linear regression model that incorporates all of these variables.

Linear Model: NUM_ARREST ~ TMAX + PRCP_factor + year + month
Estimate Std. Error t value Pr(>|t|)
(Intercept) 97442.25 1530.816 63.654 0.0000
TMAX 2.34 0.404 5.785 0.0000
PRCP_factor1 -46.13 7.368 -6.261 0.0000
year -47.96 0.760 -63.072 0.0000
month02 15.16 17.434 0.870 0.3845
month03 3.70 17.531 0.211 0.8330
month04 -52.88 19.351 -2.733 0.0063
month05 -69.01 21.337 -3.234 0.0012
month06 -125.43 23.518 -5.333 0.0000
month07 -160.82 25.009 -6.431 0.0000
month08 -131.47 24.384 -5.392 0.0000
month09 -144.15 22.672 -6.358 0.0000
month10 -82.77 19.830 -4.174 0.0000
month11 -135.52 18.046 -7.510 0.0000
month12 -205.30 17.138 -11.979 0.0000

The linear model of arrests as predicted by of TMAX, precipitation factor, year, and month regressors. The model has an overall R^2 value 0.427 which is a weak overall fit for the data, but does indicate that these regressors explain some of the variability in the data. The coefficients for TMAX, precipitation factor, year, and most months are statistically significant.

There is a weak positive relationship between arrests and TMAX, for every degree increase in temperature there are about 2.34 more arrests. Precipitation factor had a slightly more impactful effect on the number of daily arrests. The linear model shows a decrease of 46.13 arrests on days with precipitation, confirming what we saw in our t-test. The model also shows a decreasing number of arrests over time and that most months have a statistically significant relationship with the number of daily arrests.

##Stock Market

Next, we looked at the effect weather has on the stock market, looking at the trade volume of the Dow Jones Industrial index in the 1990s. This time period was chosen due to the transition to digital stock trading with the rise of computers. We hypothesized that if weather had an effect on trade volume, the effect would be greatest for this time given the data available.

To start investigating this relationship, initial EDA was performed.

The histogram of the daily DOW stock volume shows the data is right skewed so we used a square root normalization of the volume, and reexamined the distribution.

The distribution of normalized Volume is considerably more normal, although there still appears to be a right-skew to the distribution. Given the number of observations, we will assume this is a normal enough distribution for further analysis.

A t-test was performed to compare the mean normalized stock volume on days with and without precipitation. The t-test did not reject the null hypothesis as the values are nearly identical regardless of precipitation. We then wanted to look at correlations between all the variables.

The only strong correlation that exists in this data set is the relationship between year and the stock volume. There are no other strong correlations present in the data. This is apparently supported by the scatter plot of Stock Volume vs TMAX, with points encoded by month and PRCP factor.

A linear model was created to highlight potential hidden relationships that may exist in the data.

Linear Model: StockVolume_norm ~ TMAX + PRCP_factor + year + month
Estimate Std. Error t value Pr(>|t|)
(Intercept) -861.9944 11.8722 -72.606 0.0000
TMAX 0.0052 0.0023 2.236 0.0254
PRCP_factor1 -0.0352 0.0437 -0.805 0.4209
year 0.4352 0.0060 73.099 0.0000
month02 -0.1506 0.1032 -1.458 0.1448
month03 -0.2522 0.1017 -2.479 0.0132
month04 -0.1396 0.1124 -1.242 0.2142
month05 -0.3995 0.1215 -3.288 0.0010
month06 -0.4460 0.1349 -3.306 0.0010
month07 -0.4749 0.1435 -3.308 0.0009
month08 -0.4178 0.1391 -3.003 0.0027
month09 -0.2517 0.1285 -1.958 0.0503
month10 0.0814 0.1135 0.717 0.4734
month11 0.0488 0.1058 0.461 0.6446
month12 -0.0297 0.1007 -0.295 0.7678

The linear model incorporates all the stock volume data from 1988-2000 had a statistically significant TMAX, year, and some of the months. The relationship between TMAX and stock volume is misleading because of the difference in scale between the variables. The stock volume is on the scale of millions so even though there is a statistically significant relationship, it is not a strong correlation. The relationship with TMAX is more likely related to the seasonality that is shown by the statistically significant months in the model as the warmer months more statistically significant. The overall model has an r-squared value of 0.641, but that is likely primarily driven by the year variable, as indicated by the strong correlation above.

COVID Data

The last human activity we wanted to explore was public health by looking at a relationship between weather and COVID-19 case counts. In our previous effort, we saw that COVID-19 lockdown had an effect on the temperature in NYC so here we are looking for weather’s effect on the number of cases in the city.

One notable flaw with this use of the dataset was that its case counts were tracked based on confirmation dates rather than test dates, which means the weather data does not always match up to the case number dates. We continued with the analysis with the assumption that same day tests would out number longer turnaround tests and relationships that exist would still be identifiable.

This was new data for our team so we performed initial EDA followed by a linear model to observe relationship with weather and COVID-19 case counts.

The COVID case count data was extremely skewed right so outliers were removed and a square-root transform was used to normalize the data.

A t-test comparing COVID-19 case counts on days with precipitation and days without precipitation was then performed. There was not a statistically significant difference in the mean number of COVID-19 case counts on days with and without precipitation. Boxplots of the precipitation and no precipitation COVID case counts are below.

Following the t-test, we wanted to look at the correlations of all the variables that can be included in a linear model.

The strongest correlation existed between year and the COVID-19 case counts, followed by a negative correlation between daily temperature and case counts. There was no correlation between precipitation and COVID case counts.

These variables were plotted in the scatter plot below to visualize the relationships prior to building the linear model.

We built a linear model that predicts the normalized COVID-19 case counts using TMAX, precipitation factor, year, and month, as the regressors.

Linear Model: sqrt_count ~ TMAX + PRCP_factor + year + month
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.56e+04 1.22e+03 -12.739 0.0000
TMAX -5.53e-02 5.57e-02 -0.993 0.3210
PRCP_factor1 -6.30e-01 9.33e-01 -0.676 0.4994
year 7.74e+00 6.05e-01 12.788 0.0000
month02 -1.87e+01 3.03e+00 -6.159 0.0000
month03 -1.70e+01 2.99e+00 -5.674 0.0000
month04 -9.47e+00 3.15e+00 -3.010 0.0027
month05 -1.91e+01 3.41e+00 -5.618 0.0000
month06 -2.63e+01 3.76e+00 -7.004 0.0000
month07 -2.15e+01 3.94e+00 -5.461 0.0000
month08 -2.07e+01 3.89e+00 -5.318 0.0000
month09 -2.32e+01 3.62e+00 -6.395 0.0000
month10 -2.54e+01 3.43e+00 -7.396 0.0000
month11 -1.66e+01 3.22e+00 -5.150 0.0000
month12 2.09e+00 3.31e+00 0.630 0.5289

The linear model predicting COVID-19 cases in New York City with weather, year, and month as regressors. The model has an Adjusted R-squared value of 0.347. This indicates the model is a weak fit but it does explain some of the variation in the number of cases. The model coefficients are not all significant. Both TMAX and precipitation factor are not statistically significant in this model. Instead, year and most months are statistically significant, which drives the fit and predictive ability of the model.

Linear Regression to Predict Precipitation!

The last analysis performed does not directly address our original questions, but our team looked to use the relationships between human activity and weather to predict precipitation factor.

Our team built a logistic regression to predict precipitation factor using TMAX, year, month, and human activity via number of arrests and stock volume. These two variables were chosen for human activity because the previous linear models show they had some statistical significant relationship to weather.

Logistic Model: PRCP_factor ~ TMAX + NUM_ARREST + StockVolume_norm + year + month
Estimate Std. Error z value Pr(>|z|)
(Intercept) 72.9312 20.4760 3.562 0.0004
TMAX -0.0057 0.0038 -1.503 0.1327
NUM_ARREST -0.0009 0.0001 -6.669 0.0000
StockVolume_norm -0.0103 0.0089 -1.151 0.2497
year -0.0359 0.0101 -3.546 0.0004
month02 0.2426 0.1681 1.443 0.1489
month03 0.1495 0.1693 0.883 0.3772
month04 0.4001 0.1854 2.158 0.0309
month05 0.3266 0.2052 1.592 0.1114
month06 0.5036 0.2230 2.259 0.0239
month07 0.3447 0.2418 1.425 0.1541
month08 0.2012 0.2363 0.851 0.3946
month09 -0.0494 0.2227 -0.222 0.8244
month10 0.0741 0.1933 0.384 0.7013
month11 -0.0123 0.1787 -0.069 0.9450
month12 0.0470 0.1695 0.277 0.7815

The logit model to predict the log odds of precipitation factor, most of the coefficients in the model are not statistically significant. Year, two separate months, and number of arrests are statistically significant. For year, for every one unit increase in year, the odds of precipitation decrease by a factor of 0.01. This is not a particularly strong effect on the odds. For number of arrests, for every additional arrest, the odds of precipitation decrease by a factor 0.001. This again is not a strong effect, but when accounting for the difference in scale for number of arrests per day, this is a more interesting relationship.

Unfortunately, this logistic regression is not a great model to predict precipitation. The overall accuracy of the model, using a default cutoff of 0.5, is 0.64. This is not a great improvement over a null model. Additionally, the precision for this model is only 0.03 and the recall rate is only 0.51.

The McFadden value for this model is 0.01, which shows this model does not explain any variability in the precipitation.

We also plotted the receiver operator characteristic (ROC) to measure the true positive rate verse the true negative rate of this model.

## Area under the curve: 0.575

The area under the ROC curve is 0.575. This confirms the model is not a good fit as it is only slightly greater than the null AUC of 0.5.

However, this logistic regression appears to confirm a relationship between number of arrests and precipitation. This relationship would be something of interest to explore further in future study.

Conclusions

Overall,we have identified statistically significant correlations between weather, air quality, and human activity data from NYC– but none of our models demonstrate high predictive potential.

We were unable to develop a model for Central Park warming on a century scale, due to a lack of quality data over this time frame that accounted for relevant human activities (such as expansion of the built environment, local population or greenhouse gas emissions on a daily basis). This work has given us a new appreciation of just how complicated weather and climate models must be– especially if they are to be used predictively.

In our air quality analysis, our hypothesis that a correlation would exist between daily weather and air quality variables was ultimately proven wrong. We observed trends of declining AQI over time, but the explanation of variance from our model results was not strong enough to deem the model a good fit. Similarly, a linear model predicting AQI based on the categorical month variable, along with TMAX, also resulted in a poor fit.

We determined that the relationship between air quality and global warming is difficult to model using linear techniques due to seasonal trends in the variables. Our attempt to model the effect of these trends using kNN also resulted in a poor-fitted model. Ultimately, a different type of model would be required to address the seasonal component.

Also, changes in climate are slow to take effect. A increase in emissions does not necessarily lead to increases in temperature on the same time scale. All these effects would need to be taken into consideration for an effective analysis.

We did identify correlations between daily weather and local human activity in some areas. Both crime and stock market trade volume had statistically significant correlations to daily weather variables in our linear models. Crime (daily numbers of shootings and arrests) is correlated to both temperature and precipitation while stock market trade volume is related to temperature. However, models based on these correlations are not strong or complete enough to be predictive.

Observed correlations between crime and weather were the strongest we found in this analysis. This represents a valuable area for future study in the context of changing weather patterns that was explored in our early project.

There were notable limitations to the methods in this analysis. One key limitation that affected the analysis of public health was the availability of essential data. The COVID-19 case data was based on dates when positive cases were confirmed, rather than tested. Because test and confirmation dates are not always the same, this limited our ability assess relationships that existed on the day of an individual’s test. Looking for alternative data sources to explore this relationship would be an interesting area for a future project. From our results, we also conclude that linear models might not be optimal for representing data with seasonal trends.

These questions are only some of the important questions that should be asked about the relationship between humans, weather, climate, and the environment. As climate change continues, it will be increasingly important to continue to study its effects on people, from the global scale down to local communities around the world.

References

City of New York. NYPD Shooting Incident Data (Historic) retrieved December 6, 2022, from https://data.cityofnewyork.us/Public-Safety/NYPD-Shooting-Incident-Data-Historic-/833y-fsy8

Lindsey, R.; Dahlman, L. (2022, June 28). Climate change: Global temperature. Climate.gov. Retrieved December 11, 2022, from https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature

NYC Environment and Health Data Portal, Tracking changes in New York City’s sources of air pollution., published online April 12, 2021, retrieved December 6, 2022, from https://a816-dohbesp.nyc.gov/IndicatorPublic/beta/data-stories/aq-cooking/

NOAA National Centers for Environmental Information, Monthly Global Climate Report for Annual 2021, published online January 2022, retrieved on November 2, 2022 from https://www.ncei.noaa.gov/access/monitoring/monthly-report/global/202113

NOAA National Centers for Environmental Information, “Climate Data Online.” Custom data requests and downloads retrieved on October 1 (Central Park data) and October 16, 2022 (JFK data) from https://www.ncdc.noaa.gov/cdo-web/

Parida, B. R., Bar, S., Kaskaoutis, D., Pandey, A. C., Polade, S. D., & Goswami, S. (2021). Impact of COVID-19 induced lockdown on land surface temperature, aerosol, and urban heat in Europe and North America. Sustainable cities and society, 75, 103336. https://doi.org/10.1016/j.scs.2021.103336

Storms, C. (2015). New York City Panel on Climate Change 2015 Report: Executive summary. Ann NY Acad Sci, 1336, 9-17. Retrieved on October 29, 2022 from https://nyaspubs.onlinelibrary.wiley.com/doi/10.1111/nyas.14008

United States Environmental Protection Agency (EPA), Air Data Basic Information., retrieved December 8, 2022, from, https://www.epa.gov/outdoor-air-quality-data/air-data-basic-information

United States Environmental Protection Agency (EPAa), Climate Change Indicators: U.S. and Global Precipitation, published online August 2022, retrieved on November 2, 2022 from https://www.epa.gov/climate-indicators/climate-change-indicators-us-and-global-precipitation

United States Environmental Protection Agency (EPAb), Learn About Heat Islands. Retrieved on November 3, 2022 from https://www.epa.gov/heatislands/learn-about-heat-islands

Horanont, T., Phithakkitnukoon, S., Leong, T. W., Sekimoto, Y., & Shibasaki, R. (2013). Weather effects on the patterns of people’s everyday activities: a study using GPS traces of mobile phone users. PloS one, 8(12), e81153. https://doi.org/10.1371/journal.pone.0081153

https://www.climate.gov/news-features/understanding-climate/climate-change-global-temperature